Learning near-optimal policies with fitted policy iteration and a single sample path
نویسندگان
چکیده
In this paper we consider the problem of learning a near-optimal policy in continuous-space, expected total discounted-reward Markovian Decision Problems using approximate policy iteration. We consider batch learning where the training data consists of a single sample path of a fixed, known, persistently-exciting stationary stochastic policy. We derive PAC-style bounds on the difference of the performance of the policy returned by the algorithm and the optimal value function in both L∞ and weighted L-norms.
منابع مشابه
Optimal Sample Selection for Batch-mode Reinforcement Learning
We introduce the Optimal Sample Selection (OSS) meta-algorithm for solving discrete-time Optimal Control problems. This meta-algorithm maps the problem of finding a near-optimal closed-loop policy to the identification of a small set of one-step system transitions, leading to high-quality policies when used as input of a batch-mode Reinforcement Learning (RL) algorithm. We detail a particular i...
متن کاملEnsemble Usage for More Reliable Policy Identification in Reinforcement Learning
Reinforcement learning (RL) methods employing powerful function approximators like neural networks have become an interesting approach for optimal control. Since they learn a policy from observations, they are also applicable when no analytical description of the system is available. Although impressive results have been reported, their handling in practice is still hard, as they can fail at re...
متن کاملOnline Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes
This paper presents a new method to learn online policies in continuous state, continuous action, model-free Markov decision processes, with two properties that are crucial for practical applications. First, the policies are implementable with a very low computational cost: once the policy is computed, the action corresponding to a given state is obtained in logarithmic time with respect to the...
متن کاملNear-Minimum-Time Motion Planning of Manipulators along Specified Path
The large amount of computation necessary for obtaining time optimal solution for moving a manipulator on specified path has made it impossible to introduce an on line time optimal control algorithm. Most of this computational burden is due to calculation of switching points. In this paper a learning algorithm is proposed for finding the switching points. The method, which can be used for both ...
متن کاملOn the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error ǫ at each iteration, it is well-known that one can compute stationary policies that are 2γ (1−γ)2 ǫ-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iter...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005